Scene understanding is an essential and challenging task in computer vision. To provide the visually fundamental graphical structure of an image, the scene graph has received increased attention due to its powerful semantic representation. However, it is difficult to draw a proper scene graph for image retrieval, image generation, and multi-modal applications. The conventional scene graph annotation interface is not easy to use in image annotations, and the automatic scene graph generation approaches using deep neural networks are prone to generate redundant content while disregarding details. In this work, we propose SGDraw, a scene graph drawing interface using object-oriented scene graph representation to help users draw and edit scene graphs interactively. For the proposed object-oriented representation, we consider the objects, attributes, and relationships of objects as a structural unit. SGDraw provides a web-based scene graph annotation and generation tool for scene understanding applications. To verify the effectiveness of the proposed interface, we conducted a comparison study with the conventional tool and the user experience study. The results show that SGDraw can help generate scene graphs with richer details and describe the images more accurately than traditional bounding box annotations. We believe the proposed SGDraw can be useful in various vision tasks, such as image retrieval and generation.
translated by 谷歌翻译
Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
最近,知识表示学习(KRL)正在作为对知识图(kgs)处理查询的最新方法的出现,其中kg实体和查询被嵌入到一个潜在空间中,以使回答查询的实体是嵌入在查询附近。然而,尽管对KRL进行了深入的研究,但大多数现有研究要么侧重于同质KG,要么承担kg完成任务(即缺失事实的推断),同时回答对具有多个方面的kgs的复杂逻辑查询(多视图kg)仍然是一个开放的挑战。为了弥合这一差距,在本文中,我们提出了罗马,这是一个新颖的KRL框架,用于回答多视图KGS的逻辑查询。与先前的工作相比,罗姆人在主要方面离开。 (i)它将多视图kg建模为一组覆盖子kg,每个kg对应于一种视图,该视图集成了文献中研究的许多类型的kg(例如,颞kg)。 (ii)它支持具有不同关系和视图约束的复杂逻辑查询(例如,具有复杂的拓扑和/或从多个视图中); (iii)它比例扩大到大小(例如,数百万个事实)和细粒状视图(例如,数十个观点); (iv)它概括地查询训练过程中未观察到的结构和kg观点。对现实世界KGS的广泛经验评估表明,\系统明显优于替代方法。
translated by 谷歌翻译
图神经网络(GNN)已成功应用于许多真实世界静态图。但是,由于模型设计,评估设置和训练策略的局限性,静态图的成功尚未完全转化为动态图。具体而言,现有的动态GNN并不包含静态GNN的最新设计,从而限制了其性能。动态GNN的当前评估设置不能完全反映动态图的不断发展的性质。最后,用于动态GNN的常用训练方法是不可扩展的。在这里,我们提出了Roland,这是现实世界动态图的有效图表学习框架。 Roland框架的核心可以帮助研究人员轻松地将任何静态GNN重新用于动态图。我们的见解是将不同GNN层的节点嵌入视为分层节点状态,然后随着时间的推移将其反复更新。然后,我们为动态图引入了实时更高的评估设置,该设置模仿了现实世界中的用例,其中GNN正在做出预测并在滚动基础上进行更新。最后,我们通过增量训练和元学习提出了一种可扩展有效的训练方法,以动态GNN。我们在未来链接预测任务上对八个不同的动态图数据集进行了实验。在三个数据集的标准评估设置下,使用Roland框架建立的模型平均相对平均互惠等级(MRR)的平均相对平均值(MRR)改进。我们发现对较大数据集的最先进的基线经历了不可存储的错误,而Roland可以轻松地扩展到5600万个边缘的动态图。在使用ROLAND训练策略重新实现这些基准线后,Roland模型平均相对于基线相对相对改善了15.5%。
translated by 谷歌翻译
文本情绪分析(也称为意见挖掘)是对实体表达的人们观点,评估,态度和情感的计算的研究。文本情绪分析可以分为文本级别的情感分析,森林级别的情感分析和方面级别的情感分析。基于方面的情感分析(ABSA)是情感分析领域中的精细任务,该任务旨在预测各个方面的极性。训练前神经模型的研究显着改善了许多自然语言处理任务的性能。近年来,培训模型(PTM)已在ABSA中应用。因此,有一个问题,即PTM是否包含ABSA的足够的句法信息。在本文中,我们探讨了最近的Deberta模型(解码增强的BERT,并引起注意),以解决基于方面的情感分析问题。 Deberta是一种基于Transformer的神经语言模型,它使用自我监督的学习来预先培训大量原始文本语料库。基于局部环境重点(LCF)机制,通过整合Deberta模型,我们为基于方面的情感分析的多任务学习模型。该实验导致了Semeval-2014最常用的笔记本电脑和餐厅数据集,而ACL Twitter数据集则表明,具有Deberta的LCF机制具有显着改善。
translated by 谷歌翻译
深度神经网络(DNN)已经证明了他们在各种域中的表现。但是,它提出了社会问题,如果他们适用于涉及有价值的资源分配的敏感域,如教育,贷款和就业,则会引发社会问题。在DNN可靠地部署到这样的敏感域之前,执行公平性测试至关重要,即,尽可能多地生成以发现公平违规的情况。然而,现有的测试方法仍然有限于三个方面:可解释性,性能和概括性。为了克服挑战,我们提出了一个新的DNN公平测试框架,与以前的工作不同于在几个关键方面的内容:(1)可解释 - 它定量解释DNNS的公平违反偏见决定的公平违规; (2)有效 - 它使用解释结果在更少的时间内引导更多样化的情况; (3)通用 - 它可以处理结构化和非结构化数据。在7个数据集中的广泛评估和相应的DNN展示了神经元的优越性。例如,在结构化数据集上,它会产生更多的实例(〜x5.84)并节省更多时间(平均加速度为534.56%),与最先进的方法相比。此外,还可以利用神经元的情况来改善偏置DNN的公平,这有助于构建更公平和值得信赖的深度学习系统。
translated by 谷歌翻译
Multimodal deep learning has been used to predict clinical endpoints and diagnoses from clinical routine data. However, these models suffer from scaling issues: they have to learn pairwise interactions between each piece of information in each data type, thereby escalating model complexity beyond manageable scales. This has so far precluded a widespread use of multimodal deep learning. Here, we present a new technical approach of "learnable synergies", in which the model only selects relevant interactions between data modalities and keeps an "internal memory" of relevant data. Our approach is easily scalable and naturally adapts to multimodal data inputs from clinical routine. We demonstrate this approach on three large multimodal datasets from radiology and ophthalmology and show that it outperforms state-of-the-art models in clinically relevant diagnosis tasks. Our new approach is transferable and will allow the application of multimodal deep learning to a broad set of clinically relevant problems.
translated by 谷歌翻译
Most recent studies on neural constituency parsing focus on encoder structures, while few developments are devoted to decoders. Previous research has demonstrated that probabilistic statistical methods based on syntactic rules are particularly effective in constituency parsing, whereas syntactic rules are not used during the training of neural models in prior work probably due to their enormous computation requirements. In this paper, we first implement a fast CKY decoding procedure harnessing GPU acceleration, based on which we further derive a syntactic rule-based (rule-constrained) CKY decoding. In the experiments, our method obtains 95.89 and 92.52 F1 on the datasets of PTB and CTB respectively, which shows significant improvements compared with previous approaches. Besides, our parser achieves strong and competitive cross-domain performance in zero-shot settings.
translated by 谷歌翻译
The success of Deep Learning applications critically depends on the quality and scale of the underlying training data. Generative adversarial networks (GANs) can generate arbitrary large datasets, but diversity and fidelity are limited, which has recently been addressed by denoising diffusion probabilistic models (DDPMs) whose superiority has been demonstrated on natural images. In this study, we propose Medfusion, a conditional latent DDPM for medical images. We compare our DDPM-based model against GAN-based models, which constitute the current state-of-the-art in the medical domain. Medfusion was trained and compared with (i) StyleGan-3 on n=101,442 images from the AIROGS challenge dataset to generate fundoscopies with and without glaucoma, (ii) ProGAN on n=191,027 from the CheXpert dataset to generate radiographs with and without cardiomegaly and (iii) wGAN on n=19,557 images from the CRCMS dataset to generate histopathological images with and without microsatellite stability. In the AIROGS, CRMCS, and CheXpert datasets, Medfusion achieved lower (=better) FID than the GANs (11.63 versus 20.43, 30.03 versus 49.26, and 17.28 versus 84.31). Also, fidelity (precision) and diversity (recall) were higher (=better) for Medfusion in all three datasets. Our study shows that DDPM are a superior alternative to GANs for image synthesis in the medical domain.
translated by 谷歌翻译
Harvesting question-answer (QA) pairs from customer service chatlog in the wild is an efficient way to enrich the knowledge base for customer service chatbots in the cold start or continuous integration scenarios. Prior work attempts to obtain 1-to-1 QA pairs from growing customer service chatlog, which fails to integrate the incomplete utterances from the dialog context for composite QA retrieval. In this paper, we propose N-to-N QA extraction task in which the derived questions and corresponding answers might be separated across different utterances. We introduce a suite of generative/discriminative tagging based methods with end-to-end and two-stage variants that perform well on 5 customer service datasets and for the first time setup a benchmark for N-to-N DialogQAE with utterance and session level evaluation metrics. With a deep dive into extracted QA pairs, we find that the relations between and inside the QA pairs can be indicators to analyze the dialogue structure, e.g. information seeking, clarification, barge-in and elaboration. We also show that the proposed models can adapt to different domains and languages, and reduce the labor cost of knowledge accumulation in the real-world product dialogue platform.
translated by 谷歌翻译